Gossip-based Failure Detection and Consensus for Terascale Computing
نویسنده
چکیده
of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science GOSSIP-BASED FAILURE DETECTION AND CONSENSUS FOR TERASCALE COMPUTING By Rajagopal Subramaniyan May 2003 Chair: Alan D. George Department: Electrical and Computer Engineering One promising avenue of research on failure detection for large systems is the use of gossiping and other epidemic techniques to disseminate availability information. Gossip protocols and services provide a way to detect failures in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. However, the performance of these failuredetection protocols in terms of metrics like correctness, completeness and scalability need to be improved for use in terascale clusters. The consensus algorithm to detect failures in a system-wide fashion should be resilient enough to detect both individual node and group failures. Gossiping with consensus can take place throughout the system via a flat structure, or it can be hierarchically distributed across cooperating layers of nodes. This thesis presents a scalable multilayered gossip protocol, with features to strengthen the completeness and correctness of the failure-detection service. We examine the performance of the layered gossip protocol in terms of scalability on an experimental
منابع مشابه
DisTriB: Distributed Trust Management Model Based on Gossip Learning and Bayesian Networks in Collaborative Computing Systems
The interactions among peers in Peer-to-Peer systems as a distributed collaborative system are based on asynchronous and unreliable communications. Trust is an essential and facilitating component in these interactions specially in such uncertain environments. Various attacks are possible due to large-scale nature and openness of these systems that affects the trust. Peers has not enough inform...
متن کاملDisTriB: Distributed Trust Management Model Based on Gossip Learning and Bayesian Networks in Collaborative Computing Systems
The interactions among peers in Peer-to-Peer systems as a distributed collaborative system are based on asynchronous and unreliable communications. Trust is an essential and facilitating component in these interactions specially in such uncertain environments. Various attacks are possible due to large-scale nature and openness of these systems that affects the trust. Peers has not enough inform...
متن کاملExperimental Evaluation of a Failure Detection Service Based on a Gossip Strategy
Failure detectors were first proposed as an abstraction that makes it possible to solve consensus in asynchronous systems. A failure detector is a distributed oracle that provides information about the state of processes of a distributed system. This work presents a failure detection service based on a gossip strategy. The service was implemented on the JXTA platform. A simulator was also imple...
متن کاملEpidemic Failure Detection and Consensus for Extreme Parallelism
Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum’s User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault ...
متن کاملA Failure Detection Service Based on Epidemic Dissemination for Peer-to-Peer Networks
Failure detectors were first proposed as an abstraction that makes it possible to solve consensus in asynchronous systems. A failure detector is a distributed oracle that provides information about the state of processes of a distributed system. This work presents a failure detection service based on a gossip strategy. The service was implemented on the JXTA platform. A simulator was also imple...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002